Assigned dataset contains 32 variables and 1885 records from UCL Repository. Dataset contains information about different drugs usage (legal and illegal) and people's characteristics such as level of education, ethnicity, age and personality traits. There were several options to solve classification problem using presented dataset. Concerning its possible future applications, it has been decided to build a model that predicts the probability that a person with given characteristics has used any kind of drug in a past year. Characteristics that are used in model should be available without much problem. Moreover, this report does not solve the problem which drug a person will use but whether he/she will use any.
package & data import
from sklearn.neighbors import KernelDensity
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import tree
import numpy as np
import pandas as pd
# import statsmodels.api as sm
import matplotlib.pyplot as plt
import random
# import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(color_codes=True);
from scipy import stats
data = pd.read_csv("drug_consumption.csv", sep=';', encoding='UTF-8')
data
changing column names
data.columns = ['id', 'age', 'gender', 'edu', 'country', 'ethnicity'
, 'Neuroticism', 'Extraversion', 'Openness',
'Agreeableness', 'Conscientiousness',
'impulsiveness', 'sensation'
,'Alcohol', 'Amphet', 'Amyl', 'Benzos'
, 'Caff', 'Cannabis', 'Choc', 'Coke'
,'Crack', 'Ectasy', 'Heroin', 'Ketamine'
, 'Legalh', 'LSD', 'Meth', 'Mushrooms', 'Nicotine', 'Semer', 'VSA']
data.describe()
data.describe(include=['O'])
Values of analysed variables are difficult to interpret hence their names and levels are changed as below.
Personality traits values are changed to standarised levels (4 levels each where 1 is the lowest level of given characteristic and 4 - the highest) according to their freuency in the dataset.
for col in ['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness', 'Conscientiousness','impulsiveness', 'sensation']:
treshold_1=data[col].quantile(q=0.25)
treshold_2=data[col].quantile(q=0.5)
treshold_3=data[col].quantile(q=0.75)
data[col]=data[col].apply(lambda x: np.where(x<=treshold_1, 1, np.where(x<=treshold_2, 2, np.where(x<=treshold_3, 3, 4))))
For other variables such as age, gender, education, country and ethnicity categorical variable values are changed to category names which are easier to interpret.
data['age']=data['age'].apply(lambda x: np.where(round(x*1000,0)==-952, '18-24'
, np.where(round(x*1000,0)==-79, '25-34'
, np.where(round(x*1000,0)==498, '35-44'
, np.where(round(x*1000,0)==1094, '45-54'
, np.where(round(x*1000,0)==1822, '55-64'
, np.where(round(x*1000,0)==2592 , '65+', '')))))))
data['gender']=data['gender'].apply(lambda x: np.where(round(x*1000,0)==-482, 'male',
np.where(round(x*1000,0)==482, 'female', '')))
data['edu']=data['edu'].apply(lambda x: np.where(round(x*1000,0)==-2436 , 1
, np.where(round(x*1000,0)==-1738, 2
, np.where(round(x*1000,0)==-1437, 3
, np.where(round(x*1000,0)==-1228 , 4
, np.where(round(x*1000,0)==-611, 5
, np.where(round(x*1000,0)==-59 , 6
, np.where(round(x*1000,0)==45, 7
, np.where(round(x*1000,0)==1164 , 8
, np.where(round(x*1000,0)==1984, 9, ''))))))))))
data['country']=data['country'].apply(lambda x: np.where(round(x*1000,0)==-98 , 'Australia'
, np.where(round(x*1000,0)==249, 'Canada'
, np.where(round(x*1000,0)==-468, 'New Zealand'
, np.where(round(x*1000,0)==-285, 'Other'
, np.where(round(x*1000,0)==211, 'Republic of Ireland'
, np.where(round(x*1000,0)==961, 'UK'
, np.where(round(x*1000,0)==-570, 'USA', 'Other'))))))))
data['ethnicity']=data['ethnicity'].apply(lambda x: np.where(round(x*1000,0)==-502 , 'Asian'
, np.where(round(x*1000,0)==-1107 , 'Black'
, np.where(round(x*1000,0)==1907, 'Mixed-Black/Asian'
, np.where(round(x*1000,0)==126, 'Mixed-White/Asian'
, np.where(round(x*1000,0)==-222, 'Mixed-White/Black'
, np.where(round(x*1000,0)==114, 'Other'
, np.where(round(x*1000,0)==-317, 'White', 'Other'))))))))
For this modeling problem a person is considered as a drug user if the drug was used in the last year or earlier, hence particular time is not important. Variables are transformed into binary.
for col in ['Alcohol', 'Amphet', 'Amyl', 'Benzos'
, 'Caff', 'Cannabis', 'Choc', 'Coke'
,'Crack', 'Ectasy', 'Heroin', 'Ketamine'
, 'Legalh', 'LSD', 'Meth', 'Mushrooms', 'Nicotine', 'Semer', 'VSA']:
data[col]=data[col].apply(lambda x: np.where(x in ['CL0','CL1','CL2'],0,1))
data.describe(include=['O'])
Concerning non-numerical variables it may be noticed that the dataset is not very diverse in terms of ethnicity - over 90% of respondents is white. Majority of them is from the United Kingdom and the most frequent age group is 18-24.
data.columns
data.iloc[:,6:13].describe()
data.iloc[:,14:32].describe()
Not surprisingly, over 90% of respondents has used caffeine or chocolate in the past year. What is more, over 50% has used nicotine or cannabis. Over 1 out of 4 respondents has also used benzos, ectasy and/or legal highs.
fig, axes = plt.subplots(3,4,figsize=[15,10])
num_cols=data[['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
'Conscientiousness', 'impulsiveness', 'sensation', 'Alcohol', 'Caff',
'Cannabis', 'Choc', 'Nicotine']]
for i, col in enumerate(num_cols):
sns.boxplot(x=col,color=np.random.rand(4,), data=data, ax=axes[i//4,i%4],fliersize=1,width=0.3)
plt.tight_layout()
As it has been presented in table above, alcohol, caffeine, cannabis, chocolate and nicotine are commonly used among respondents.
Target variable creation
A person is considered as a drug user if he/she has used one of the drugs below in the past year:
Substances such as nicotine, alcohol, caffeine, cannabis and chocolate are excluded from the target variable as being legal (in some of the countries surveyed) and/or commonly (over 50% of respondents) used. Variable Semer which is a fake drug and was introduced to identify over-claimers was dropped due to low relevancy in this particular modeling problem and low frequency.
data = data.assign(drug_use_tmp = data.Amphet+ data.Amyl+ data.Benzos +data.Coke+ data.Crack+data.Ectasy+data.Heroin+data.Ketamine+data.Legalh+data.LSD+data.Meth+data.Mushrooms+data.VSA)
data = data.assign(drug_use = np.nan)
data['drug_use'] = data['drug_use_tmp'].apply(lambda x: 1 if x >0 else 0)
#Dropping columns that has been replaced or are irrelevant (Semer, fake drug, is not helpful in this analysis)
data.drop(['id','drug_use_tmp', 'Amphet', 'Amyl', 'Benzos'
, 'Coke'
,'Crack', 'Ectasy', 'Heroin', 'Ketamine'
, 'Legalh', 'LSD', 'Meth', 'Mushrooms', 'Semer', 'VSA'],
axis = 1, inplace = True)
data['drug_use'].describe()
55% of observations is tagged as a drug user. It means that the dataset is balanced and there is no need to oversample the data.
#Replacing categorisation variables with dummies
data_cat = pd.get_dummies(data, columns=data.describe(include=['O']).columns, prefix = data.describe(include=['O']).columns)
fig, axes = plt.subplots(11,4,figsize=[10,20])
num_cols=data_cat.select_dtypes(include=np.number).columns
for i, col in enumerate(num_cols):
sns.distplot(data_cat[col],ax=axes[i//4,i%4])
fig.delaxes(axes[3,2])
plt.tight_layout()
Relationships with target variable
Correlation matrix
corr = data_cat.corr()
plt.rcParams['figure.figsize'] = [5,5]
ax = sns.heatmap(corr, square = True, cmap=sns.diverging_palette(20, 220, n=200))
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right');
Only significant correlations
import copy
#Caluclating p-value for each correlation value
corr = data_cat.corr()
corr_new = copy.deepcopy(corr)
corr_p = copy.deepcopy(corr)
for i in np.arange(1,corr.shape[0]):
for j in np.arange(1,corr.shape[1]):
corr_p.iloc[i,j] = stats.stats.pearsonr( data_cat[corr.index.values[i]], data_cat[corr.columns[j]] )[1]
indices = corr_p > 0.05
corr_new[indices] = 0
plt.rcParams['figure.figsize'] = [5,5]
ax = sns.heatmap(corr_new, square = True,
vmin = -1, vmax = 1, center = 0,
cmap=sns.diverging_palette(30, 150, n=200))
Since number of variables is significant and there are several correlations visible on the charts above, relationships are studied more carefully each by each. Variables were divided into 4 groups to ensure readability. Correlations with target variable are presented below.
Group 1
num_cols = data_cat.iloc[:,0:12].select_dtypes(include=np.number).columns
fig, axes = plt.subplots(4,3,figsize=[15,10])
for i, col in enumerate(num_cols):
c_ax=axes[i//3,i%3]
sns.regplot(x=col,y='drug_use', color=sns.color_palette("muted", n_colors=12)[i],
data=data_cat, ax=c_ax)
slope, intercept, *_ = stats.linregress(data_cat[col],data_cat['drug_use'])
#function is returning around 5 values and we are unpacking it, but we are using only intercept and underscore variable
line = slope*data_cat[col]+intercept
c_ax.plot(data_cat[col], line, 'black', alpha=0.5)
fig.delaxes(axes[3,2])
plt.tight_layout()
Neuroticism, openness, impulsiveness, sensation and cannabis have slightly positive impact on drug usage. Agreebeleness and conscientiousness are the only variable in this group that have negative impact. Extraversion, alcohol, caffeine and chocolate consumption does not seem to have an effect on target variable.
Group 2
num_cols = data_cat.iloc[:,11:21].select_dtypes(include=np.number).columns
fig, axes = plt.subplots(4,3,figsize=[15,10])
for i, col in enumerate(num_cols):
c_ax=axes[i//3,i%3]
sns.regplot(x=col,y='drug_use', color=sns.color_palette("muted", n_colors=15)[i],
data=data_cat, ax=c_ax)
slope, intercept, *_ = stats.linregress(data_cat[col],data_cat['drug_use'])
#function is returning around 5 values and we are unpacking it, but we are using only intercept and underscore variable
line = slope*data_cat[col]+intercept
c_ax.plot(data_cat[col], line, 'black', alpha=0.5)
fig.delaxes(axes[3,2])
plt.tight_layout()
Nicotine consumption, being young (below 24 years old) and being a men increase chances for drug consumption. Being older (above 45 years old) and being a female decreases probability of drug usage.
Group 3
num_cols = data_cat.iloc[:,21:30].select_dtypes(include=np.number).columns
fig, axes = plt.subplots(5,2,figsize=[15,25])
for i, col in enumerate(num_cols):
c_ax=axes[i//2,i%2]
sns.regplot(x=col,y='drug_use', color=sns.color_palette("muted", n_colors=15)[i],
data=data_cat, ax=c_ax)
slope, intercept, *_ = stats.linregress(data_cat[col],data_cat['drug_use'])
#function is returning around 5 values and we are unpacking it, but we are using only intercept and underscore variable
line = slope*data_cat[col]+intercept
c_ax.plot(data_cat[col], line, 'black', alpha=0.5)
fig.delaxes(axes[4,1])
plt.tight_layout()
Higher level of education favors a tendency to drug usage until some point - for levels 4 (left school at 18 years) and 5 (some college or university but no certificate or degree) the probability is higher than for the lower levels but for level 6 and above the probability is decreasing (professional certifiacate/diploma, masters or doctorate degree).
Group 4
num_cols = data_cat.iloc[:,30:45].select_dtypes(include=np.number).columns
fig, axes = plt.subplots(5,3,figsize=[15,20])
for i, col in enumerate(num_cols):
c_ax=axes[i//3,i%3]
sns.regplot(x=col,y='drug_use', color=sns.color_palette("muted", n_colors=15)[i],
data=data_cat, ax=c_ax)
slope, intercept, *_ = stats.linregress(data_cat[col],data_cat['drug_use'])
#function is returning around 5 values and we are unpacking it, but we are using only intercept and underscore variable
line = slope*data_cat[col]+intercept
c_ax.plot(data_cat[col], line, 'black', alpha=0.5)
fig.delaxes(axes[3,2])
plt.tight_layout()
The only country in which the probability of using drugs is lower is the United Kingdom. For Canada there is no correaltion between the country and drug usage. People of Asian and Black ethnicity seem to use drugs less than the others but mixed Black/Asian have higher probability.
Summary of EDA
Several variables studied seem to have a positive or negative effect on target variable. Only few presented below may be dropped as there is no significant correlations between them and drug usage:
data_clean = data_cat
data_clean.drop(['Caff', 'Choc', 'Alcohol', 'Extraversion'],axis=1, inplace=True)
#Splitting data into train and test
X = data_clean.drop(['drug_use'],axis=1)
y = data_clean['drug_use']
%time X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 1)
CART = tree.DecisionTreeRegressor(random_state=42,ccp_alpha=0.0)
CART_model=CART.fit(X_train,y_train)
path = CART.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas[::10], path.impurities[::10]
fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set");
clfs = []
for ccp_alpha in ccp_alphas:
clf = tree.DecisionTreeRegressor(random_state=42, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
def RMSE(model,X,y):
return np.sqrt(((model.predict(X)-y)**2).mean())
test_scores = [RMSE(clf,X_test,y_test) for clf in clfs]
train_scores = [RMSE(clf,X_train,y_train) for clf in clfs]
fig, ax = plt.subplots(figsize=[10,10])
ax.set_xlabel("alpha")
ax.set_ylabel("RMSE")
ax.set_title("RMSE vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Best_CART = clfs[np.argmin(test_scores)]
Best_CART.ccp_alpha
min(test_scores)
Best regression tree (measured by the least root mean squared error) is produced by complexity parameter ccp_alpha equal to 0.00136. RMSE equals 0.3619.
confmat = pd.crosstab(Best_CART.predict(X_test).round(),y_test.round())
confmat
CART_accuracy = np.array([confmat.loc[i,i] for i in confmat.index]).sum()/confmat.sum().sum()
CART_accuracy
Accuracy of the model equals 84%
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
#Checking number of tress influence on RMSE
rfr = RandomForestRegressor
N = [10,50,100,150, 200,250,300,350,400,450,500,550, 600,650, 700,750, 800,850, 900,950, 1000]
RMSE_RF= [RMSE(rfr(n,n_jobs=-1).fit(X_train,y_train),X_test,y_test) for n in N]
fig = go.Figure(go.Scatter(x=N, y=RMSE_RF))
fig.update_layout(
title="RMSE vs number of trees (random forest)",
xaxis_title="Number of trees",
yaxis_title="RMSE",
)
fig.show(renderer="notebook")
print('Number of trees minimizing RMSE: ', N[np.argmin(RMSE_RF)])
#Checking number of features influence on RMSE
features = np.linspace(1,X_train.shape[1],100).astype(int)
RMSE_RF_features= [RMSE(rfr(300,max_features=n,n_jobs=-1).fit(X_train,y_train),X_test,y_test) for n in features]
fig = go.Figure(go.Scatter(x=features, y=RMSE_RF_features))
fig.update_layout(
title="RMSE vs number of features (random forest)",
xaxis_title="Number of features",
yaxis_title="RMSE",
)
fig.show(renderer="notebook")
print('Number of features minimizing RMSE: ', features[np.argmin(RMSE_RF_features)])
Best_RF = RandomForestRegressor(300,max_features=2,n_jobs=-1).fit(X_train,y_train)
# Plot the feature importances of the forest
importances = Best_RF.feature_importances_
std = np.std([tree.feature_importances_ for tree in Best_RF.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
num_feat = 3
plt.figure(figsize=[10,5])
plt.title("Feature importances")
plt.bar(range(num_feat)[:num_feat], importances[indices][:num_feat],
color="r", yerr=std[indices][:num_feat], align="center")
plt.xticks(range(num_feat)[:num_feat], X_train.columns[indices])
plt.xlim([-1, num_feat])
plt.show()
Tree variables that are important in predicing drug usage are:
confmat_RF = pd.crosstab(Best_RF.predict(X_test).round(),y_test.round())
confmat_RF
rf_accuracy = np.array([confmat_RF.loc[i,i] for i in confmat_RF.index]).sum()/confmat_RF.sum().sum()
rf_accuracy
Random forest model has accuracy of 83,745%
#Checking number of boosting stages influence on RMSE
gbr = GradientBoostingRegressor
N = [10,15,20,25,30,35,40,45,50,70,100,200,300,400,500,600,700,800,900,1000]
RMSE_GBT = [RMSE(gbr(n_estimators=n, learning_rate=0.1, alpha=0.2).fit(X_train,y_train),X_test,y_test) for n in N]
fig = go.Figure(go.Scatter(x=N, y=RMSE_GBT))
fig.update_layout(
title="RMSE vs number of boosting stages (gradient boosting)",
xaxis_title="Number of boosting stages",
yaxis_title="RMSE"
)
fig.show(renderer="notebook")
print('Number of boosting stages minimizing RMSE: ', N[np.argmin(RMSE_GBT)])
Best_GBT = GradientBoostingRegressor(n_estimators=30,learning_rate=0.05).fit(X_train,y_train)
#But we can view this as multiclass classification
confmat_GBT = pd.crosstab(Best_GBT.predict(X_test).round(),y_test.round())
confmat_GBT
#Accuracy
gbt_accuracy = np.array([confmat_GBT.loc[i,i] for i in confmat_GBT.index]).sum()/confmat_GBT.sum().sum()
gbt_accuracy
# Comparison of deviance between training and test set
test_score = np.zeros((30,), dtype=np.float64)
for i, y_pred in enumerate(Best_GBT.staged_predict(X_test)):
test_score[i] = Best_GBT.loss_(y_test, y_pred)
plt.figure(figsize=(10,5))
plt.title('Deviance')
plt.plot(np.arange(30) + 1, Best_GBT.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(30) + 1, test_score, 'r-',
label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance');
# Plot feature importance
feature_importance = Best_GBT.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
num_feat = 6
plt.figure(figsize=[10,5])
plt.barh(pos[-num_feat:], feature_importance[sorted_idx][-num_feat:], align='center')
plt.yticks(pos[-num_feat:], X_train.columns[sorted_idx][-num_feat:])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
Similar to random forest, cannabis is the most important variable. Second most important is country (USA) and country (UK)
models = [Best_CART, Best_RF, Best_GBT]
errors = [RMSE(m, X_test, y_test) for m in models]
plt.bar(['CART','Random Forest','Gradient Boosted Trees'], errors, color=['red', 'green', 'blue'], alpha=0.75);
acc = pd.DataFrame((CART_accuracy, rf_accuracy, gbt_accuracy),['CART', 'Random Forest', 'Gradient Boosted Trees'])
acc.columns = ['accuracy']
acc
All three models perform on similar level. The lowest root mean squared error was achieved for random forest model - 0.3517, but the highest accuracy was achieved in CART model - 84,1%.
Three models has been produced to predict the probability that a person with given characteristics had used drugs in the past year. Models used algorithms such as decision tree, random forest and gradient boosted trees. It turned out that each one of them gives similar results with accuracy about 84%.
Report encountered two main problems.First of all, dataset contained several variables including personality traits which values were very difficult to interpret. Relying on documentation they were transformed into 4 standardised levels based in their frequency in the dataset. Secondly, models has been trained in a way that avoids overfitting - decision tree was pruned to ensure that the final models minimizes root mean squared error on test dataset, random forest parameters such as number of trees and number of features and number of boosting stages in gradient boosting trees model were chosen by the same measure
For future research there was found several problems that can be encountered. First of all, there is little information about the origins of the dataset and after data pre-processing it turned out that over half of the respondents had used one of the drugs in the past year which may rise suspicions that respondents were selected in some specific way rather than being a random sample from the society (ratio of drug-users is surprisingly high). What is more, majority of respondents is of white ethnicity which can lead to biased results. It may be useful to repeat the analysis with greater variety of respondents, especially since there has been presented that some ethnicities have higher probability of using drugs. Furthermore, it has been assumed that characteristics taken into consideration in the model will be available without much problem. Unfortunately, the most important variable in models is cannabis consumption what will cause difficulties in future predictions for obvious reasons.
Models does not present causes of drug consumption but characteristics that are correlated with it. It means that one should be very careful with interpretation of the results and its applications. Model can support decisions regarding e.g. implementation of preventing actions such as information campaigns targeted to specific groups of people. It must not be used to support decisions regarding specific people such as recruitment.